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To understand the formation, evolution, and function of complex systems, it is crucial to understand the 
internal organization of their interaction networks. Partly due to the impossibility of visualizing large 
complex networks, resolving network structure remains a challenging problem. Here we overcome this 
difficulty by combining the visual pattern recognition ability of humans with the high processing speed of 
computers to develop an exploratory method for discovering groups of nodes characterized by common 
network properties, including but not limited to communities of densely connected nodes. Without any 
prior information about the nature of the groups, the method simultaneously identifies the number of 
groups, the group assignment, and the properties that define these groups. The results of applying our 
method to real networks suggest the possibility that most group structures lurk undiscovered in the 
fast-growing inventory of social, biological, and technological networks of scientific interest. 

The highly structured internal organization of complex networks can both impact and reflect their dynamics 
and function 1 . Previous work on identifying and studying this organization has focused mainly on network 
communities 2 " 13 , which are subsets of nodes defined by the difference between their internal and external 
link density. To provide a fresh perspective on this problem, we seek to capture more general structures char- 
acterized by other network properties 14 " 17 . For this purpose, we introduce the notion of structural groups, defined 
as subsets of nodes sharing common structural properties that set them apart from other nodes in the network. 
Using a given set ofp > 1 node properties (such as centrality and spectral properties) as the coordinates for each 
node in the p- dimensional space UP, we identify structural groups as clusters of points in this node property space. 
Figure 1 shows an illustrative example of a network for which no standard network visualization shows clear 
group structure (Fig. la). However, an appropriate two-dimensional projection in the node property space reveals 
a hidden but unambiguous three-group structure (Fig. lb), which can be used to generate a far more informative 
layout of the network (Fig. le). Application of existing community detection methods 18 " 27 is not expected to 
resolve these groups, since they are not distinguishable by link density alone (Fig. lc). Neither is the direct 
application of existing clustering methods in the full node property space nor in the projection onto any 
lower- dimensional space, due to the known fact that groups with widely different scatter sizes may not be 
correctly grouped by unsupervised algorithms (Fig. Id and Supplementary Fig. SI online). Distinguishing 
structural groups may in general require a combination of two or more properties — Fig. lb shows that the 
degree and the average degree of neighbors suffice for this example. It is difficult, however, to identify such a 
combination without knowing the groups a priori. 

Our approach overcomes these difficulties using the visual processing ability of a human user as an integral part 
of the analysis. The approach is based on visual analytics 28,29 , which is conceptualized as exploratory statistics in 
which analytical reasoning is facilitated by a visual interactive interface. Humans generally excel automated 
computer algorithms in visual recognition tasks, such as labeling images 30 and deciphering distorted texts, which 
forms the basis of spam prevention systems and crowdsourcing for the digitalization of old books 31 . We exploit 
this capability by asking the user to inspect a selection of two-dimensional projections of the node property space 
for possible separation of nodes into groups. Since any projection could potentially reveal good separation of 
groups, we first consider the result of choosing these projections randomly. For two clusters of points with a gap 
between them in high dimension, the probability can be very small for the clusters to be separable by a straight line 
in a random two-dimensional projection. This probability depends strongly on the "effective dimension" of the 
clusters. For example, if two Gaussian -distributed clusters of 100 points have their centers 6 units apart in the 28- 
dimensional space, the probability is less than 0.001 if the variance of the clusters in every direction is one, but 
increases to about 0.017 if the variance is reduced by a factor of 10 in all but 10 orthogonal directions. We find that 



SCIENTIFIC REPORTS | 1 : 151 | DOI: 1 0.1 038/srep00151 



1 



www.nature.com/ scientificreports 



a Ungrouped network b Our method Lower degree with 

layout £ higher-degree 




Degree 

Figure 1 | Discovering hidden group structure beyond density-based communities, (a) Visualization of a network by the Gursoy-Atun algorithm 51 , 
which attempts to place nodes uniformly while keeping the network neighbors close. This and other standard layout algorithms fail to disentangle 
the network and reveal any clear group structure, (b) Using our visual analytics method, a user can discover three structural groups (of sizes 150, 50, and 
30) without a priori information about the number of groups. The groups can be characterized by the degree and the neighbors' average degree, and at 
least two properties are necessary to resolve the entire group structure, (c) Even the most general community detection method 14 does not divide the 
network correctly, (d) The iC-means algorithm 45 , one of the most frequently used methods for general clustering problems, does not correctly capture 
the group structure when applied directly to the full node property space, even if the number of groups K = 3 is given, (e) Layout of the network 
using the discovered groups. For clarity, both panels (a) and (e) show only 10% of the links. 



the effective dimension is relatively small for the groups discovered in 
the networks considered here, most of them with dimension less than 
12 (out of 28) when defined as the minimum number of principal 
components required to account for 90% of the variance within the 
group. To further enhance the probability of separating groups, we 
sample random projections with a systematic bias (see Methods). 
This increases the separation probability for the example of 
Gaussian clusters above to around 0.68 for a single projection. If 
the user visually recognizes separation of nodes into groups in a 
two-dimensional projection, the group assignment is entered 
through a graphical interactive interface (Fig. 2a-d, Supplementary 
Video SI online). The integration of the visual component allows the 
user not only to supervise the process, but also to learn and create 
intuition from taking part in the process, thus facilitating the search 
for unanticipated network structures. It also accommodates nat- 
urally an ultimate goal of clustering algorithms, which is to repro- 
duce how a human would group a given set of points. 

The chance of capturing a group structure is even further 
enhanced by the multiplicative effect of using more than one projec- 
tion. Indeed, the separation probability in the example above rises 
from 0.68 to above 0.999 with just 7 projections. In general, for a 
given number L of random projections, the probability that all of 
these projections fail to separate a given pair of group decreases to 
zero exponentially with L. After the user processes a given number L 
of projections, each node i in the network will be associated with a 
group assignment vector a (0 representing the user input (Fig. 2d). 



Since we typically have a large number of distinct assignment vectors, 
we aggregate the corresponding nodes into a smaller, more mean- 
ingful number of structural groups by single-linkage hierarchical 
clustering 32 . For this, we use the Hamming distance between the 
group assignment vectors of different nodes, a (0 and a (;) , which in 
this case is the number of projections for which the user has placed 
those nodes in different groups. This results in a dendrogram that we 
can cut at a threshold distance d to obtain a grouping, in which being 
in different groups indicates that the user has placed these nodes in 
different groups in at least d out of L projections (Fig. 2e; 
Supplementary Video SI online). To compare the different group- 
ings obtained at different thresholds, we define the quality of group- 
ing Q g by 

n _]_ (\\ck-ct\\) K£ 

Q§ q g '(\\^-c ki \W W 

where vector x^ = (^x[^ , . . . ,xjp ^ represents node i in the prop- 
erty space, vector is the center of group k, index k { denotes the 
group to which node i belongs, and 1 1 • 1 1 defines the p-dimensional 
Euclidean distance. The ratio of the two bracketed quantities in Eq. 
(1) measures the average separation distance between groups (the 
average over all pairs of groups, denoted 0^) relative to the spread 

within individual groups (the average over all nodes, denoted (•)/). 
This quantity is then normalized by a constant q g , chosen to remove 
a systematic dependence of the quality of grouping on the number of 
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Figure 2 | Our visual analytics method, (a) The p node properties , . . . , are computed for each node i in a given network of n nodes. The nodes are 
then represented as points in MP, which are projected onto a randomly chosen two-dimensional subspace. (b) (c) Using a graphical interface, the user can 
either reject the projection (b), which indicates that there is no visible group separation, or indicate visible groups (c), which automatically assigns a group 
index to each node for that particular projection, (d) Repeating this for a given number of random projections, each node i is associated with a group 
assignment vector a (0 , listing the group indices the user has assigned to node i. We used 30 projections for all results in this article, (e) Dendrogram 
obtained by clustering the vectors a (1) , . . . , a (n) . Cutting the dendrogram at a threshold Hamming distance d produces a grouping for the network, 
(f) Quality of grouping Q g as a function of the threshold level d. The appropriate number of groups is determined to be K = 3 by thresholding at the Q g 
drop-off (dashed line). 



groups K (see Methods). As one lowers the threshold level, the quality 
of grouping Q g tends to drop sharply at a certain level (Fig. 2f). To 
obtain the maximum number of high-quality groups, we suggest 
choosing the group assignment, as well as the number of groups K, 
at the threshold level just above the largest drop in Q g , which we call 
the Q g drop-off. 

Results 

We implemented our visual analytics method using the selection of 
p = 28 node properties listed in Table I, which encompasses import- 
ant node attributes that capture local information, such as degree and 
clustering, and others that capture more global information, such as 
betweenness centrality and Laplacian eigenvectors. In particular, the 
eigenvectors of the Laplacian and of the normalized Laplacian allow 
the detection of communities 12,33 " 36 and bipartite or multipartite 
structures 37 , respectively, as well as mixtures of these structures, 
assuring our method the ability to detect group structures defined 
by link density as special cases. Using this set of properties for the 
example network of Fig. 1, we obtain the dendrogram shown in 
Fig. 2e. The number of groups for this network is found to be K=3 
at the Q g drop-off (Fig. 2f), which agrees with the group separation 
visible in the projection shown in Fig. lb. This accurately reflects the 
fact that the network was synthetically constructed from three dis- 
tinct structural groups: the first two groups characterized by high (> 
65) and low (^55) prescribed degrees, respectively, but connected 
randomly otherwise, and the third group characterized by higher 
connection probability with internal nodes (0.3) than with external 
ones (0.1). This example illustrates that our method is capable of 
discovering not only group structures defined by link density, but 
also more general group structures, even when different types of 
structures coexist in the same network. Moreover, as shown in 
Fig. 3 for two -group benchmark networks, the visual analytics 
method is generally expected to outperform existing methods if the 
groups have different internal structures, in this case determined by 
their different degree distributions (see Methods). 

Figure 4 shows a visualization of the hierarchy of nested struc- 
tural groups identified by applying our method to a selection of six 
real- world networks spanning different sizes and domains (Table II). 
To further characterize these groups, we rank the node properties 



based on a two-dimensional projection in which the discovered 
groups reveal maximal separation (see Methods). We then discard 
the low- ranking properties that have negligible effect on the group 
separation, keeping only those indicated under each panel. 
Surprisingly, while most groups cannot be identified using a single 
node property, the node structural groups are completely separated 
in this plane for four of the networks. The groups in three of the 
networks, the polbooks, netscience, and disease networks (Fig. 4d-f), 
are separated using two eigenvectors of the Laplacian matrix, 
suggesting that these groups could be similar to density-based 
communities detected by existing methods 2 ; when quantified by 
the Rand index 38 , however, the similarity appears relatively low 
(Supplementary Fig. S2 online). The groups in a fourth network, 
the karate network (Fig. 4a), can also be separated in a plane, but 
this projection requires the use of 15 properties led by the average 
degree, average betweenness, and average subgraph centrality 39 of 
neighbors (see Table I notes for the definition). The groups in the 
other two networks, the adjnoun and football networks (Fig. 4b-c), 
are mostly but not completely separated in this two-dimensional 
representation. We emphasize that it is not necessary for all the 
groups to be separable in a single two-dimensional projection. In 
fact, while each such projection may only illuminate part of the 
hidden group structure (such as the separation between a single 
group and all the others), the multiplicative effect of integrating 
information from many random projections is what often reveals 
the full high- dimensional structure. 

Another remarkable feature of this approach is that, because we do 
not know in advance which properties define the groups we seek to 
identify, the visual analytics method simultaneously provides the 
answer to the question— the number and identity of the structural 
groups — along with the question itself— the properties that define 
these groups. Even when these properties are abstract, further ana- 
lysis can easily reveal the nature of the network's internal organiza- 
tion. For example, consider the karate network, whose nodes are 
members of a karate club and links are interactions between two 
members in at least one context external to the club activities. 
The three structural groups identified in Fig. 4a correspond to (1) 
members who are central to the club and interact with many other 
members; (2) peripheral members interacting only with very few, but 
central members; and (3) members forming a community connected 
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Table I | Node properties used to generate our results. 

j xf> (jth property of node i) 

1, 2 The degree 0 of node /and the average degree of the neighbors 
of node i 

3 The clustering coefficient* 3 of node i 

4 The average shortest path length from node i to all the other 

nodes 

5, 6 The betweenness centrality c of node i and the average of the 

same quantity over the neighbors of node i 
7, 8 The subgraph centrality^of node /and the average of the same 

quantity over the neighbors of node i 
9-1 8 The z'th component of the eigenvector associated with the 2nd 

(smallest nonzero) through the 1 1 th eigenvalue of the 

Laplacian matrix 6 

1 9-28 The /th components associated with the 1 0 largest eigenvalues 
of the normalized Laplacian matrix f 

Here we consider only undirected and unweighted networks for simplicity. Each quantity is 

normalized to the unit interval [0, 1 ] before applying our analysis, which in this case reduces the 

node property space to the 28- dimensional unit hypercube. 

a The number of links attached to node i. 

b The fraction of pairs of neighbors of node i that are connected. 

°The number of shortest paths passing through node i. 

d The weighted sum of the number of closed paths in which node i participates 39 . 

^he Laplacian matrix L is defined byLy = -1 if nodes / and; / are connected, Ly—O if they are not 

connected, and Ly equals the degree of node i if i = j. 

'The normalized Laplacian matrix is obtained by dividing each by the degree of node /. 



to the rest of the network only through one central member 
(Supplementary Fig. S3 online). Incidentally, one of the groups we 
identify consists of nodes that are connected to those outside the 
group but to none within the group. This social group structure is 
markedly different from the well -studied eventual split of the club 
into two clubs 40 . 

As an additional example, consider the football network, where 
nodes are college American football teams and links indicate matches 
played in the 2000 season. Although the teams are organized into 12 
conferences (including Independents), our method identifies 7 struc- 
tural groups (Fig. 4c). As shown in Figs. 5a and 5b, groups 1 and 6 are 
characterized by the combination of high degrees, high subgraph 
centrality, and the same characteristics for their neighbors, while 
these two groups are distinct in clustering coefficient and some 
Laplacian eigenvectors. Low degrees and low subgraph centrality, 
as well as the same characteristics for the neighbors, distinguish 
groups 4 and 7 from others, while they differ in their clustering 
coefficient and a few Laplacian eigenvectors. Group 2 shows similar 
characteristics as group 1 in terms of subgraph centrality, but the 
mean shortest path distance is very high and the betweenness cent- 
rality of the neighbors is very low, reflecting the peripheral location of 
these nodes within the network. Many of the Laplacian eigenvectors 
contribute to the separation of the groups, which is consistent with 
the fact that a density-based community structure exists in addition 
to other group structures. In particular, groups 3 and 5 are com- 
munities that can only be distinguished by the differences in the 
Laplacian eigenvectors and clustering coefficient. Grouping together 
Big Twelve and Mountain West as well as Atlantic Coast and Big 
East, but splitting the Independents (Fig. 5c), this group structure 
captures a higher-level organization of the conferences which is 
determined by the geographic proximity of the teams (Fig. 5d). 
Similar geographical manifestation of network communities has 
recently been observed in the effective boundaries defined by human 
mobility in the US 41 and telecommunications in Great Britain 42 . 

Discussion 

The structural groups identified by the visual analytics method are 
characterized by common network properties. This provides a 
foundation for the study of the interplay between form and function 
in complex networks, as network dynamics (and hence function) is 



R 




0.05 0.1 0.15 0.2 0.25 

Pout 



Figure 3 | Performance comparison for detecting density-based 
communities. Using benchmark networks consisting of two groups, we 
compare the performance of the visual analytics method against alternative 
methods, measured by the adjusted Rand index 38 R between the computed 
and the true groupings (see Methods for the details of our benchmarking 
procedure). Our method (red filled circles) finds the correct group 
assignment almost perfectly for inter-group connection probability 
Pout < 0. 15, and performs reasonably well for larger values of p out . The 
mixture model method 14 (blue open circles) performs well forp out < 0.10. 
The iC-means algorithm 45 (green open squares) shows consistently low 
performance. The use of nonlinear kernels, dimensionality reduction 
based on principal component analysis, and alternative schemes for 
assigning weights for node properties (see Methods) led to improved 
performance only for small p out values. One of the largest such 
improvement is shown here (purple open triangles). In contrast, replacing 
the human user with the iC-means algorithm to process the two- 
dimensional projections in our method (see Methods) shows significantly 
better performance (orange open diamonds) than the direct application of 
the iC-means variants to the node property space (green open squares and 
purple open triangles), although still worse than the visual analytics 
method (red filled circles). This demonstrates both the effectiveness of the 
multiple random projection approach and the advantage of the human 
interactive component over unsupervised algorithms. Each point in the 
plot is the average of R computed after removing two outliers (smallest and 
largest R) from a total of 20 network realizations. The visual analytics 
method is generally expected to outperform existing methods if the groups 
have different internal structures, in this case determined by their different 
degree distributions (see Methods). 



believed to be strongly influenced by network structure. The pos- 
sibilities are extensive with our approach since the user has complete 
freedom to choose the set ofp node properties. Within the wide range 
of possible structures expressible through these properties, the visual 
analytics method can help discover a specific group structure of 
interest and interpret it using a ranking of the node properties. 
The approach can be easily adapted to identify network structures 
defined by link rather than node characteristics 43 . Moreover, it can be 
applied to networks whose nodes have quantifiable (but not neces- 
sarily structural) properties 44 , such as age, income and level of edu- 
cation in the case of social networks, which remain elusive in existing 
network representations. Systematic benchmarking using synthetic 
networks shows that our method has advantages over existing meth- 
ods in identifying density-based communities with distinct internal 
structures (red vs. blue curve in Fig. 3). Naturally, existing methods 
such as the one proposed in Ref. 14 may still be more effective in 
resolving specific networks not represented in our benchmarks. In 
finding general structural groups beyond density-based communit- 
ies, the visual analytics method outperforms the direct application of 



SCIENTIFICREPORTS | 1 : 151 | DOI: 1 0.1 038/srep00151 



4 



www.nature.com/ scientificreports 



a karate (3 groups) 



b adjnoun (4 groups) 



z 2 -0.1 
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Coefficients 

Zi z 2 



1. Average degree of neighbors 0.44 0.48 

2. Average betweenness -0.44 -0.10 
centrality of neighbors 

3. Average subgraph centrality -0.44 -0.02 
of neighbors 

(12 more node properties used) 
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Top 3 Node properties 



Coefficients 

Zi z 2 



1. 11 th Laplacian eigenvector -0.62 0.46 

2. 8 th Laplacian eigenvector 0.62 0.37 

3. 9 th Laplacian eigenvector -0.21 0.44 

(2 more node properties used) 

netscience (4 groups) 




Laplacian eigenvector 



i c football (7 groups) 




E 3 

E 
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Top 3 Node properties 



Coefficients 
Zi z 2 



1. Subgraph centrality 

2. Average subgraph centrality 
of neighbors 

3. 2 nd Laplacian eigenvector 



f disease (2 groups) 



-0.66 
0.36 

0.34 



0.19 
0.48 

0.46 
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s\ 
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2 nd Laplacian eigenvector 



Figure 4 | Hierarchical group structures discovered for the networks in Table II. Each panel shows the network nodes plotted in a two-dimensional 
projection of the node property space. For panels (a)-(c) the two coordinate axes, z\ and z 2 , are linear combinations of a selection of node properties 
that capture most of the group separation, with corresponding coefficients listed in the table below each plot. Note that in panels (b) and (c) two 
dimensions are not sufficient to cleanly separate the groups, even though our method resolves this separation by combining multiple twodimensional 
projections. For panels (d)-(f), only two properties are necessary to resolve the groups clearly. The plot on the right of each panel shows the quality 
of grouping Q g as a function of the hierarchical level measured by the Hamming distance. The groupings corresponding to the hierarchical levels in the 
blue part of this plot are indicated in the projections by the shades of blue. 



standard clustering algorithms in the full node property space (Fig. 1; 
Supplementary Fig. SI online; red vs. green/purple curve in Fig. 3). 
This suggests that our approach also has potential to be an alternative 
for solving general high-dimensional clustering problems. The 
replacement of the human component in the visual analytics method 
with a simple heuristics based on iC-means yields a fully objective 
unsupervised algorithm, which performs much better than various 
extensions of K- means directly applied to the full node property 
space (orange vs. green/purple curves in Fig. 3). This highlights 
the critical role played by the integrative analysis of clustering 
outputs from multiple projections. Although the visual analytics 
method converted to an unsupervised algorithm performs better 
than standard unsupervised approaches, the original formulation 



with the human component is still more effective (red vs. orange 
curve in Fig. 3). By combining the pattern recognition ability of 
humans with the processing capability of computers, our visual ana- 
lytics method can resolve the internal organization of complex net- 
works better than either of them alone. 

Methods 

Biased random projections. To enhance the probability of resolving group 
separation, we first choose each node property; with probability ry (while requiring 
a minimum of four properties) and generate a random projection using those 
selected properties. The probability ry is designed to reflect the relative importance 
of property; in separating the groups. We set r ; : = [vj/ max ; (v ; -)] °\ where 
Vj : = J2k w kvlj> and v k> j denotes the jth component of the normalized basis vector 
for the kth (out of N) one-dimensional projections generated randomly and 
uniformly. The weights are given by Wk : = max^z^z+i -Zkj) ' (i/n)'(l-i/n), where 
Zk,i <Zk,2 < <Zk, n denote the ordered points in the kth projection for all n 
nodes in the network. The parameter a can be used to adjust the bias strength and was 
taken to be 2 in all computations. 

Controlling for group-size effect in Q g . Since smaller groups naturally tend to 
have smaller within-group variations, the ratio of the averages in Eq. (1) increases 
with the number of groups K, even when the groups are not necessarily better 
separated. To correct for this bias, we define Q g by normalizing the ratio by its 
expected value q g for randomized groupings with the individual group sizes kept 
fixed. We estimated q g by averaging over 100 realizations. 

Two-group benchmark networks. For the benchmarking results shown in Fig. 3, we 
used networks having two groups, constructed as follows. In the larger group (150 
nodes), nodes are connected randomly, with the degree of each node fixed to a 
random integer chosen uniformly between 10 and 70. In the smaller group (50 nodes), 
node pairs are connected randomly with fixed probability p in . Across the two groups, 
node pairs are connected with probability p out . For a given p out , we choose p in to match 
the average degree in the smaller group with the average internal degree in the larger 
group. The probability p out is varied between 0 (two completely isolated groups) and 
40/150 « 0.27 (no internal links in the smaller group), withp out = 20/150 » 0.13 
corresponding to the point at which the average internal and external degrees in the 
smaller group are equal. 



Table II | Networks analyzed. 



Dataset 


n 


m 


K 


d„ 


Pd 


karate 


34 


78 


3 


3 


15 


polbooks 


105 


441 


2 


2 


2 


adjnoun 


112 


425 


4 


2 


5 


football 


115 


613 


7 


3 


3 


netscience 


379 


914 


4 


4 


2 


disease 


516 


1 188 


2 


5 


2 



We used the following datasets: karate, the social network of Zachary's Karate club 40 ; 
polbooks, a network of political books 49 ; adjnoun, a network of English words 36 ; football, a 

network of collegiate American football teams 2 ; netscience, a network of network scientists 36 ; 
and disease, a human disease network 50 (see Supplementary Table SI online for a more detailed 
description of these networks). Here n is the number of nodes, m is the number of links, /C is the 
number of structural groups at the Q g drop-off, d H \s the Hamming distance at the Q g drop-off, and 
pjisthe minimum number of node properties necessary to produce a two-dimensional projection in 
which most or all of the /(discovered groups are visually separated. 
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Figure 5 | Characterizing seven structural groups discovered in the football network, (a) Average node properties of the seven groups. Rows correspond 
to node properties and columns to groups (the colored disks at the top). Using the orange color-scale on the left, each cell shows the average node property 
of the group, relative to the network average and in units of the network standard deviation, (b) Node property distribution within each group. The seven 
groups in this plot are color-coded as in the disks at the top of panel (a) Small dots indicate the individual values for each node in the network, larger dots 
connected bylines indicate the group averages, and bars indicate the range of values for each group. All values are measured relative to the network average 
and in units of the network standard deviation, (c) Layout of the network with the structural groups indicated by circles, color-coded as in the other 
panels. The number and color on a node indicate the college football conference to which the corresponding team belongs, as listed at the bottom of the 
panel, (d) Geographic distribution of nodes (teams) over the US, color-coded by the structural groups as in panel (c) The fact that more than one 
conference is grouped together as groups 3, 4, and 5 can be interpreted in terms of the proximity of the teams' geographic location and its impact on the 
structure of the network. 



Benchmarking procedure. We used the two -group network described in the 
subsection above to compare performance of various methods for identifying the 
groups. For our visual analytics method, we used the node properties listed in Table I 
and generated 30 biased random projections. The threshold level for the resulting 
dendrogram was selected so as to produce two groups. In a few cases where a two- 
group threshold does not exist, we selected the threshold that results in the smallest 
possible number of groups above two. For the mixture model method 14 , the number 
of groups was set to K = 2. For K-means 45 , the algorithm was applied directly to the 
node property space with K = 2. For completeness, we also examined the 
performance of K-means using all possible combinations of choices for (i) kernel 46 
(linear, polynomial, Gaussian, or sigmoid); (ii) dimensionality reduction (projecting 
the data points in the node property space onto the 2, 5, 10, 15, or 20 leading principal 
components, or no reduction); and (iii) normalization (scaling each node property 
to have zero mean and unit variance, normalizing each property to the unit interval 



[0,1], or no normalization). Scaling for zero mean and unit variance is equivalent to 
weighing each node property equally when measuring distances in the node property 
space, while normalizing to the unit interval ensures that all the node properties are 
distributed in the same range. For the unsupervised variant of our visual analytics 
method, the human user was replaced by the (linear) K- means algorithm with K = 1, 
2, ... , 10 to analyze each two-dimensional projection, with an optimal choice of K 
determined by the gap statistic 47 , which is defined based on a characteristic signature 
in the K- dependence of the within-group variation. The performance of each method 
was measured by the adjusted Rand index R between the computed and the true 
groupings (see the subsection below for definition). 

Rand index. This index measures the similarity between two ways of grouping a given 
set of discrete objects, possibly into different numbers of groups. For a given pair of 
groupings of network nodes, the adjusted Rand index R is defined as the normalized 
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fraction of node pairs that are either classified in the same group in both groupings or 
classified in different groups in both groupings 38 . The normalization implies that R = 
1 for identical groupings and R ~ 0 for a pair of random groupings. 

Ranking node properties. For a given node grouping, we seek a two-dimensional 
projection that maximizes («jt||cjt — c\\ 2 ) k j \|| x ^ — c k t || 2 ^> .» a group separation 
measure similar to that in Eq. (1) but computed for the projected points after the 
groups have been identified. Here n k denotes the number of nodes in group k, and c 
denotes the center of all the data points. Such a projection plane can be efficiently 
found by a spectral method 48 based on the QR decomposition. The node properties 
are then ranked in the order of increasing angle between their coordinate axes and the 
projection plane. 

Software. A version of the visual analytics software that implements our method for 
all the networks discussed in this article is available at http://purl.oclc.org/net/ 
find_structural_group s 
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